Repeat the analysis shown in the R lab of this chapter, but use TOTAL13 as the outcome variable. Please build both the regression model and the decision tree model (for regression). Identify the final models you would select, evaluate the models, and compare the regression model with the tree model.
library(RCurl)
## Loading required package: bitops
AD <- read.csv(text=getURL("https://raw.githubusercontent.com/shuailab/ind_498/master/resource/data/AD2.csv"))
AD$ID = c(1:dim(AD)[1])
str(AD)
## 'data.frame': 517 obs. of 18 variables:
## $ AGE : num 71.7 77.7 72.8 69.6 70.9 65.1 79.6 73.6 60.7 70.6 ...
## $ PTGENDER : int 2 1 2 1 1 2 2 2 1 2 ...
## $ PTEDUCAT : int 14 18 18 13 13 20 20 18 19 18 ...
## $ FDG : num 6.82 6.37 6.37 6.37 6.37 ...
## $ AV45 : num 1.11 1.11 1.11 1.11 1.11 ...
## $ HippoNV : num 0.529 0.538 0.269 0.576 0.601 ...
## $ e2_1 : int 1 0 0 0 1 0 0 0 0 1 ...
## $ e4_1 : int 0 0 1 0 0 1 1 1 1 1 ...
## $ rs3818361 : int 1 1 1 1 1 1 1 1 0 0 ...
## $ rs744373 : int 1 0 1 1 1 0 1 1 0 1 ...
## $ rs11136000: int 1 1 1 1 1 0 0 1 0 0 ...
## $ rs610932 : int 1 1 0 1 0 1 1 1 0 1 ...
## $ rs3851179 : int 1 0 1 0 0 1 0 0 1 0 ...
## $ rs3764650 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ rs3865444 : int 0 1 1 0 0 0 1 1 1 0 ...
## $ MMSCORE : int 26 30 30 28 29 30 30 27 28 30 ...
## $ TOTAL13 : num 8 1.67 12 3 10 3.67 4 11 3 9 ...
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
AD_demo <- subset(AD, select=c("TOTAL13", "AGE","PTGENDER","PTEDUCAT","ID"))
str(AD_demo)
## 'data.frame': 517 obs. of 5 variables:
## $ TOTAL13 : num 8 1.67 12 3 10 3.67 4 11 3 9 ...
## $ AGE : num 71.7 77.7 72.8 69.6 70.9 65.1 79.6 73.6 60.7 70.6 ...
## $ PTGENDER: int 2 1 2 1 1 2 2 2 1 2 ...
## $ PTEDUCAT: int 14 18 18 13 13 20 20 18 19 18 ...
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
library(ggplot2)
p <- ggplot(AD_demo, aes(x = PTEDUCAT, y = TOTAL13))
p <- p + geom_point(size=4)
p <- p + labs(title="TOTAL13 versus PTEDUCAT")
print(p)
Scatter plot “TOTAL13 versus PTEDUCAT” shows a weak positive relationship between predictors with TOTAL13
library(ggplot2)
p <- ggplot(AD_demo, aes(x = AGE, y = TOTAL13))
p <- p + geom_point(size=4)
p <- p + labs(title="TOTAL13 versus AGE")
print(p)
Scatter plot “TOTAL13 versus AGE” shows a weak positive relationship between predictors with TOTAL13
# fit a simple linear regression model with AGE
library(car)
lm.AD_demo <- lm(TOTAL13 ~ AGE, data = AD_demo)
summary(lm.AD_demo)
##
## Call:
## lm(formula = TOTAL13 ~ AGE, data = AD_demo)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.777 -5.598 -1.834 4.063 39.299
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2456 3.5461 0.069 0.944819
## AGE 0.1887 0.0486 3.884 0.000116 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.913 on 515 degrees of freedom
## Multiple R-squared: 0.02845, Adjusted R-squared: 0.02657
## F-statistic: 15.08 on 1 and 515 DF, p-value: 0.0001164
‘AGE’ is statistically significant with p-value of 0.000116, rejecting null hypothesis (H0: no relationship between TOTAL 13 and AGE). R-squared is 0.02845, indicating that ‘AGE’ predictor represents only 2.8% of variability in TOTAL 13
lm.AD_demo2 <- lm(TOTAL13 ~ AGE + PTGENDER + PTEDUCAT, data = AD_demo)
summary(lm.AD_demo2)
##
## Call:
## lm(formula = TOTAL13 ~ AGE + PTGENDER + PTEDUCAT, data = AD_demo)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.691 -5.450 -1.972 4.124 38.582
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.52738 4.21805 1.073 0.28363
## AGE 0.16025 0.04867 3.292 0.00106 **
## PTGENDER 2.18848 0.71130 3.077 0.00220 **
## PTEDUCAT -0.34519 0.13026 -2.650 0.00830 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.825 on 513 degrees of freedom
## Multiple R-squared: 0.05359, Adjusted R-squared: 0.04806
## F-statistic: 9.684 on 3 and 513 DF, p-value: 3.164e-06
With all three demographics varibles included into the model, R-squared value increased to 0.05359, indicating 5.4% of variability in TOTAL 13 can be explained by the three variables. P-value of all three variabbles are significant as their p-values are 0.00106, 0.00220,0.00830, all less than 0.01.
require(ggplot2)
p <- ggplot(AD_demo, aes(x = PTEDUCAT, y = TOTAL13))
p <- p + geom_point(aes(colour=AGE), size=2)
p <- p + labs(title="TOTAL13 versus PTEDUCAT")
print(p)
Because the relationship between Total 13 and PTEDUCAT changes according to the levels of AGE, the same scatterplot on two levels of age can be examined.
p <- ggplot(AD_demo[which(AD_demo$AGE < 60),], aes(x = PTEDUCAT, y = TOTAL13))
p <- p + geom_point(size=2)
p <- p + geom_smooth(method = lm)
p <- p + labs(title="TOTAL13 versus PTEDUCAT when AGE < 60")
print(p)
p <- ggplot(AD_demo[which(AD_demo$AGE > 80),], aes(x = PTEDUCAT, y = TOTAL13))
p <- p + geom_point(size=2)
p <- p + geom_smooth(method = lm)
p <- p + labs(title="TOTAL13 versus PTEDUCAT when AGE > 80")
print(p)
TOTAL13 shows very weak positive correlation with PTEDUCAT when AGE < 60. On the other hand, TOTAL13 is negatively correlated with PTEDUCAT for those who are older than 80 years old.
p <- ggplot(AD_demo, aes(x = AGE, y = TOTAL13))
p <- p + geom_point(size=2)
p <- p + labs(title="TOTAL13 versus Age")
print(p)
#for male
p <- ggplot(AD_demo[which(AD_demo$PTGENDER == 1),], aes(x = AGE, y = TOTAL13))
p <- p + geom_point(size=2)
p <- p + labs(title="TOTAL13 versus Age - male")
print(p)
# for female
p <- ggplot(AD_demo[which(AD_demo$PTGENDER == 2),], aes(x = AGE, y = TOTAL13))
p <- p + geom_point(size=2)
p <- p + labs(title="TOTAL13 versus Age - female")
print(p)
Scatter plot between AGE and TOTAL13 for male shows more spread out patterns than those for female.
# inlcuding interaction term: AGE*PTEDUCAT
lm.AD_demo2 <- lm(TOTAL13 ~ AGE + PTGENDER + PTEDUCAT + AGE*PTEDUCAT, data = AD_demo)
summary(lm.AD_demo2)
##
## Call:
## lm(formula = TOTAL13 ~ AGE + PTGENDER + PTEDUCAT + AGE * PTEDUCAT,
## data = AD_demo)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.259 -5.233 -1.888 3.890 38.180
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -19.61428 20.96925 -0.935 0.35003
## AGE 0.48724 0.28244 1.725 0.08511 .
## PTGENDER 2.25736 0.71344 3.164 0.00165 **
## PTEDUCAT 1.15343 1.28174 0.900 0.36860
## AGE:PTEDUCAT -0.02042 0.01737 -1.175 0.24042
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.822 on 512 degrees of freedom
## Multiple R-squared: 0.05614, Adjusted R-squared: 0.04877
## F-statistic: 7.613 on 4 and 512 DF, p-value: 5.795e-06
Including the interaction term increased R-squared value from 0.05359 (from the AD_demo without the interation term) to 0.05614. However, the interaction term is not statistically significant to reject the null hypothesis because the p-value was 0.24, more than 0.05.
# Diagnostics graphs:
require("ggfortify")
## Loading required package: ggfortify
## Warning: namespace 'DBI' is not available and has been replaced
## by .GlobalEnv when processing object 'quiet'
## Warning: namespace 'DBI' is not available and has been replaced
## by .GlobalEnv when processing object 'quiet'
autoplot(lm.AD_demo2, which = 1:6, ncol = 3, label.size = 3)
Residuals vs fitted values is used to detect non-linearity, unequal error variances and outliers. Our graph shows no significant patterns, indicating the model is fairly fit. The residuals is clustered around the fitted line. Scale-Location and standarized resial shows no significant patterns either.
Normal Q-Q shows if the data came from some theoretical distribution (ex. normal or exponential). Our graph is not perfectly straight, showing the opportunity to improve the model.
We tried add more predictors to improve the model:
# try full-scale model - exclude MMSCORE as it is other output
AD_full <- AD[,c(1:17)]
AD_full <- subset(AD_full, select = -c(MMSCORE) )
names(AD_full)
## [1] "AGE" "PTGENDER" "PTEDUCAT" "FDG" "AV45"
## [6] "HippoNV" "e2_1" "e4_1" "rs3818361" "rs744373"
## [11] "rs11136000" "rs610932" "rs3851179" "rs3764650" "rs3865444"
## [16] "TOTAL13"
lm.AD <- lm(TOTAL13 ~ ., data = AD_full)
summary(lm.AD)
##
## Call:
## lm(formula = TOTAL13 ~ ., data = AD_full)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.4440 -4.1462 -0.4275 3.7954 24.8925
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53.83420 5.91765 9.097 < 2e-16 ***
## AGE -0.03661 0.04175 -0.877 0.380900
## PTGENDER 0.76054 0.57025 1.334 0.182908
## PTEDUCAT -0.34366 0.10275 -3.345 0.000886 ***
## FDG -3.89724 0.46608 -8.362 6.15e-16 ***
## AV45 7.06724 1.49597 4.724 3.01e-06 ***
## HippoNV -36.24541 4.20495 -8.620 < 2e-16 ***
## e2_1 -0.33649 0.95733 -0.351 0.725373
## e4_1 -0.04923 0.61540 -0.080 0.936268
## rs3818361 -0.61150 0.57083 -1.071 0.284570
## rs744373 -0.18606 0.54306 -0.343 0.732032
## rs11136000 -0.32233 0.57256 -0.563 0.573713
## rs610932 1.13649 0.56506 2.011 0.044832 *
## rs3851179 -0.33988 0.54648 -0.622 0.534264
## rs3764650 0.65356 0.68652 0.952 0.341558
## rs3865444 0.93384 0.53949 1.731 0.084072 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.079 on 501 degrees of freedom
## Multiple R-squared: 0.4423, Adjusted R-squared: 0.4256
## F-statistic: 26.48 on 15 and 501 DF, p-value: < 2.2e-16
To predict TOTAL13 Value, a full model was built with all the demographics,genetics and imaging variables.PTEDUCAT, FDG, AV45, Hipponv, rs610932 are significant based on p-values. R-squared is now increased to 0.4423, indicating 44% of the variability in TOTAL13 can be explained by the variables.
# try taking AGE out to find differences
lm.AD.reduced <- lm.AD;
lm.AD.reduced <- update(lm.AD.reduced, ~ . - AGE);
summary(lm.AD.reduced);
##
## Call:
## lm(formula = TOTAL13 ~ PTGENDER + PTEDUCAT + FDG + AV45 + HippoNV +
## e2_1 + e4_1 + rs3818361 + rs744373 + rs11136000 + rs610932 +
## rs3851179 + rs3764650 + rs3865444, data = AD_full)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9824 -4.1548 -0.4448 3.7769 25.3646
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 50.57095 4.60043 10.993 < 2e-16 ***
## PTGENDER 0.73102 0.56912 1.284 0.19957
## PTEDUCAT -0.33283 0.10198 -3.264 0.00117 **
## FDG -3.90679 0.46584 -8.387 5.10e-16 ***
## AV45 6.94916 1.48956 4.665 3.96e-06 ***
## HippoNV -34.92868 3.92688 -8.895 < 2e-16 ***
## e2_1 -0.32283 0.95699 -0.337 0.73600
## e4_1 0.06169 0.60213 0.102 0.91844
## rs3818361 -0.59838 0.57050 -1.049 0.29474
## rs744373 -0.18304 0.54292 -0.337 0.73616
## rs11136000 -0.33890 0.57211 -0.592 0.55387
## rs610932 1.15310 0.56461 2.042 0.04165 *
## rs3851179 -0.33920 0.54636 -0.621 0.53498
## rs3764650 0.64759 0.68633 0.944 0.34585
## rs3865444 0.93810 0.53934 1.739 0.08259 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.078 on 502 degrees of freedom
## Multiple R-squared: 0.4414, Adjusted R-squared: 0.4258
## F-statistic: 28.33 on 14 and 502 DF, p-value: < 2.2e-16
R-squared value was slightly reduced from 0.4423 to 0.4414, but almost no effects. We can use F-test to compare the full model with this new model by applying anova() function.
anova(lm.AD.reduced,lm.AD)
## Analysis of Variance Table
##
## Model 1: TOTAL13 ~ PTGENDER + PTEDUCAT + FDG + AV45 + HippoNV + e2_1 +
## e4_1 + rs3818361 + rs744373 + rs11136000 + rs610932 + rs3851179 +
## rs3764650 + rs3865444
## Model 2: TOTAL13 ~ AGE + PTGENDER + PTEDUCAT + FDG + AV45 + HippoNV +
## e2_1 + e4_1 + rs3818361 + rs744373 + rs11136000 + rs610932 +
## rs3851179 + rs3764650 + rs3865444
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 502 18542
## 2 501 18514 1 28.422 0.7692 0.3809
By F-test, p-velue is 0.7692 which indicates that two models are statisticaly indistinguishable. We tried to remove the latest last significant predictor, e4_1.
lm.AD.reduced <- update(lm.AD.reduced, ~ . - e4_1);
summary(lm.AD.reduced);
##
## Call:
## lm(formula = TOTAL13 ~ PTGENDER + PTEDUCAT + FDG + AV45 + HippoNV +
## e2_1 + rs3818361 + rs744373 + rs11136000 + rs610932 + rs3851179 +
## rs3764650 + rs3865444, data = AD_full)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.945 -4.170 -0.439 3.767 25.314
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 50.5787 4.5953 11.007 < 2e-16 ***
## PTGENDER 0.7292 0.5683 1.283 0.20001
## PTEDUCAT -0.3324 0.1018 -3.265 0.00117 **
## FDG -3.9138 0.4603 -8.503 < 2e-16 ***
## AV45 6.9930 1.4254 4.906 1.26e-06 ***
## HippoNV -34.9195 3.9220 -8.904 < 2e-16 ***
## e2_1 -0.3387 0.9434 -0.359 0.71970
## rs3818361 -0.5944 0.5686 -1.045 0.29636
## rs744373 -0.1811 0.5421 -0.334 0.73841
## rs11136000 -0.3375 0.5714 -0.591 0.55496
## rs610932 1.1536 0.5640 2.045 0.04135 *
## rs3851179 -0.3403 0.5457 -0.624 0.53314
## rs3764650 0.6482 0.6856 0.945 0.34491
## rs3865444 0.9398 0.5386 1.745 0.08159 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.072 on 503 degrees of freedom
## Multiple R-squared: 0.4414, Adjusted R-squared: 0.4269
## F-statistic: 30.57 on 13 and 503 DF, p-value: < 2.2e-16
anova(lm.AD.reduced,lm.AD)
## Analysis of Variance Table
##
## Model 1: TOTAL13 ~ PTGENDER + PTEDUCAT + FDG + AV45 + HippoNV + e2_1 +
## rs3818361 + rs744373 + rs11136000 + rs610932 + rs3851179 +
## rs3764650 + rs3865444
## Model 2: TOTAL13 ~ AGE + PTGENDER + PTEDUCAT + FDG + AV45 + HippoNV +
## e2_1 + e4_1 + rs3818361 + rs744373 + rs11136000 + rs610932 +
## rs3851179 + rs3764650 + rs3865444
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 503 18542
## 2 501 18514 2 28.81 0.3898 0.6774
We can repeat this process until no more variable could be deleted or use step() function to achieve the automation of this.
# model selection
lm.AD.F <- step(lm.AD, direction="backward", test="F")
## Start: AIC=1881.94
## TOTAL13 ~ AGE + PTGENDER + PTEDUCAT + FDG + AV45 + HippoNV +
## e2_1 + e4_1 + rs3818361 + rs744373 + rs11136000 + rs610932 +
## rs3851179 + rs3764650 + rs3865444
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - e4_1 1 0.24 18514 1879.9 0.0064 0.9362678
## - rs744373 1 4.34 18518 1880.1 0.1174 0.7320322
## - e2_1 1 4.57 18518 1880.1 0.1235 0.7253731
## - rs11136000 1 11.71 18525 1880.3 0.3169 0.5737126
## - rs3851179 1 14.29 18528 1880.3 0.3868 0.5342644
## - AGE 1 28.42 18542 1880.7 0.7692 0.3809001
## - rs3764650 1 33.49 18547 1880.9 0.9063 0.3415580
## - rs3818361 1 42.41 18556 1881.1 1.1476 0.2845704
## - PTGENDER 1 65.73 18579 1881.8 1.7788 0.1829079
## <none> 18514 1881.9
## - rs3865444 1 110.72 18624 1883.0 2.9963 0.0840724 .
## - rs610932 1 149.48 18663 1884.1 4.0452 0.0448324 *
## - PTEDUCAT 1 413.38 18927 1891.3 11.1867 0.0008856 ***
## - AV45 1 824.71 19338 1902.5 22.3178 3.005e-06 ***
## - FDG 1 2583.75 21097 1947.5 69.9200 6.153e-16 ***
## - HippoNV 1 2745.58 21259 1951.4 74.2992 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=1879.94
## TOTAL13 ~ AGE + PTGENDER + PTEDUCAT + FDG + AV45 + HippoNV +
## e2_1 + rs3818361 + rs744373 + rs11136000 + rs610932 + rs3851179 +
## rs3764650 + rs3865444
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - e2_1 1 4.35 18518 1878.1 0.1179 0.7314465
## - rs744373 1 4.41 18518 1878.1 0.1195 0.7296991
## - rs11136000 1 11.82 18526 1878.3 0.3205 0.5715653
## - rs3851179 1 14.23 18528 1878.3 0.3857 0.5348293
## - AGE 1 28.57 18542 1878.7 0.7748 0.3791652
## - rs3764650 1 33.44 18547 1878.9 0.9066 0.3414752
## - rs3818361 1 42.95 18557 1879.1 1.1647 0.2810068
## - PTGENDER 1 65.89 18580 1879.8 1.7866 0.1819460
## <none> 18514 1879.9
## - rs3865444 1 110.52 18624 1881.0 2.9968 0.0840459 .
## - rs610932 1 149.46 18663 1882.1 4.0526 0.0446365 *
## - PTEDUCAT 1 413.67 18927 1889.4 11.2166 0.0008716 ***
## - AV45 1 896.25 19410 1902.4 24.3019 1.121e-06 ***
## - FDG 1 2628.05 21142 1946.6 71.2598 3.378e-16 ***
## - HippoNV 1 2750.50 21264 1949.5 74.5799 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=1878.06
## TOTAL13 ~ AGE + PTGENDER + PTEDUCAT + FDG + AV45 + HippoNV +
## rs3818361 + rs744373 + rs11136000 + rs610932 + rs3851179 +
## rs3764650 + rs3865444
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - rs744373 1 4.12 18522 1876.2 0.1120 0.7379869
## - rs11136000 1 11.90 18530 1876.4 0.3232 0.5699302
## - rs3851179 1 12.73 18531 1876.4 0.3457 0.5567995
## - AGE 1 28.98 18547 1876.9 0.7871 0.3754009
## - rs3764650 1 33.82 18552 1877.0 0.9187 0.3382775
## - rs3818361 1 41.72 18560 1877.2 1.1332 0.2876095
## - PTGENDER 1 65.01 18583 1877.9 1.7658 0.1845085
## <none> 18518 1878.1
## - rs3865444 1 110.49 18628 1879.1 3.0012 0.0838175 .
## - rs610932 1 150.64 18669 1880.2 4.0918 0.0436207 *
## - PTEDUCAT 1 412.99 18931 1887.5 11.2179 0.0008709 ***
## - AV45 1 917.13 19435 1901.0 24.9117 8.285e-07 ***
## - FDG 1 2626.99 21145 1944.7 71.3560 3.223e-16 ***
## - HippoNV 1 2767.45 21286 1948.1 75.1714 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=1876.18
## TOTAL13 ~ AGE + PTGENDER + PTEDUCAT + FDG + AV45 + HippoNV +
## rs3818361 + rs11136000 + rs610932 + rs3851179 + rs3764650 +
## rs3865444
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - rs3851179 1 11.94 18534 1874.5 0.3250 0.5688721
## - rs11136000 1 11.98 18534 1874.5 0.3260 0.5682597
## - AGE 1 28.68 18551 1875.0 0.7804 0.3774352
## - rs3764650 1 32.94 18555 1875.1 0.8964 0.3442065
## - rs3818361 1 42.02 18564 1875.3 1.1434 0.2854451
## - PTGENDER 1 65.45 18588 1876.0 1.7810 0.1826239
## <none> 18522 1876.2
## - rs3865444 1 112.37 18634 1877.3 3.0576 0.0809696 .
## - rs610932 1 149.72 18672 1878.3 4.0741 0.0440754 *
## - PTEDUCAT 1 408.92 18931 1885.5 11.1269 0.0009136 ***
## - AV45 1 922.51 19445 1899.3 25.1021 7.537e-07 ***
## - FDG 1 2627.13 21149 1942.8 71.4857 3.029e-16 ***
## - HippoNV 1 2765.66 21288 1946.1 75.2552 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=1874.51
## TOTAL13 ~ AGE + PTGENDER + PTEDUCAT + FDG + AV45 + HippoNV +
## rs3818361 + rs11136000 + rs610932 + rs3764650 + rs3865444
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - rs11136000 1 10.75 18545 1872.8 0.2928 0.5886675
## - AGE 1 28.74 18563 1873.3 0.7831 0.3766019
## - rs3764650 1 31.87 18566 1873.4 0.8684 0.3518371
## - rs3818361 1 38.98 18573 1873.6 1.0621 0.3032322
## - PTGENDER 1 65.48 18600 1874.3 1.7842 0.1822318
## <none> 18534 1874.5
## - rs3865444 1 111.00 18645 1875.6 3.0245 0.0826260 .
## - rs610932 1 153.95 18688 1876.8 4.1947 0.0410676 *
## - PTEDUCAT 1 406.39 18940 1883.7 11.0728 0.0009399 ***
## - AV45 1 936.07 19470 1898.0 25.5052 6.173e-07 ***
## - FDG 1 2616.06 21150 1940.8 71.2800 3.305e-16 ***
## - HippoNV 1 2782.84 21317 1944.8 75.8241 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=1872.81
## TOTAL13 ~ AGE + PTGENDER + PTEDUCAT + FDG + AV45 + HippoNV +
## rs3818361 + rs610932 + rs3764650 + rs3865444
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - AGE 1 29.74 18575 1871.6 0.8115 0.368107
## - rs3764650 1 32.16 18577 1871.7 0.8775 0.349347
## - rs3818361 1 34.99 18580 1871.8 0.9548 0.328962
## - PTGENDER 1 64.88 18610 1872.6 1.7703 0.183940
## <none> 18545 1872.8
## - rs3865444 1 109.13 18654 1873.8 2.9777 0.085032 .
## - rs610932 1 154.78 18700 1875.1 4.2233 0.040385 *
## - PTEDUCAT 1 403.69 18949 1881.9 11.0147 0.000969 ***
## - AV45 1 971.95 19517 1897.2 26.5197 3.743e-07 ***
## - FDG 1 2608.87 21154 1938.9 71.1834 3.435e-16 ***
## - HippoNV 1 2787.67 21332 1943.2 76.0621 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=1871.64
## TOTAL13 ~ PTGENDER + PTEDUCAT + FDG + AV45 + HippoNV + rs3818361 +
## rs610932 + rs3764650 + rs3865444
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - rs3764650 1 31.73 18606 1870.5 0.8661 0.352482
## - rs3818361 1 32.25 18607 1870.5 0.8804 0.348546
## - PTGENDER 1 59.54 18634 1871.3 1.6252 0.202946
## <none> 18575 1871.6
## - rs3865444 1 110.73 18685 1872.7 3.0224 0.082730 .
## - rs610932 1 159.88 18734 1874.1 4.3639 0.037207 *
## - PTEDUCAT 1 382.68 18957 1880.2 10.4455 0.001309 **
## - AV45 1 963.61 19538 1895.8 26.3020 4.163e-07 ***
## - FDG 1 2645.81 21220 1938.5 72.2183 < 2.2e-16 ***
## - HippoNV 1 2957.93 21532 1946.0 80.7377 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=1870.52
## TOTAL13 ~ PTGENDER + PTEDUCAT + FDG + AV45 + HippoNV + rs3818361 +
## rs610932 + rs3865444
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - rs3818361 1 34.38 18641 1869.5 0.9387 0.33306
## - PTGENDER 1 60.70 18667 1870.2 1.6573 0.19855
## <none> 18606 1870.5
## - rs3865444 1 112.69 18719 1871.6 3.0766 0.08003 .
## - rs610932 1 170.43 18777 1873.2 4.6532 0.03146 *
## - PTEDUCAT 1 373.19 18980 1878.8 10.1890 0.00150 **
## - AV45 1 961.94 19568 1894.6 26.2633 4.240e-07 ***
## - FDG 1 2636.85 21243 1937.0 71.9926 2.378e-16 ***
## - HippoNV 1 2947.01 21553 1944.5 80.4608 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=1869.48
## TOTAL13 ~ PTGENDER + PTEDUCAT + FDG + AV45 + HippoNV + rs610932 +
## rs3865444
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - PTGENDER 1 59.51 18700 1869.1 1.6251 0.202963
## <none> 18641 1869.5
## - rs3865444 1 118.62 18759 1870.8 3.2391 0.072494 .
## - rs610932 1 165.19 18806 1872.0 4.5106 0.034168 *
## - PTEDUCAT 1 361.40 19002 1877.4 9.8682 0.001779 **
## - AV45 1 951.70 19592 1893.2 25.9870 4.855e-07 ***
## - FDG 1 2661.14 21302 1936.5 72.6645 < 2.2e-16 ***
## - HippoNV 1 2926.99 21568 1942.9 79.9238 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=1869.12
## TOTAL13 ~ PTEDUCAT + FDG + AV45 + HippoNV + rs610932 + rs3865444
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 18700 1869.1
## - rs3865444 1 132.2 18832 1870.8 3.6041 0.058203 .
## - rs610932 1 154.5 18855 1871.4 4.2137 0.040610 *
## - PTEDUCAT 1 317.9 19018 1875.8 8.6699 0.003383 **
## - AV45 1 907.3 19608 1891.6 24.7451 8.958e-07 ***
## - FDG 1 2768.2 21468 1938.5 75.4946 < 2.2e-16 ***
## - HippoNV 1 3216.4 21917 1949.2 87.7196 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.AD.F)
##
## Call:
## lm(formula = TOTAL13 ~ PTEDUCAT + FDG + AV45 + HippoNV + rs610932 +
## rs3865444, data = AD_full)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.7759 -4.3069 -0.4518 3.9067 25.5206
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 51.19143 4.23927 12.076 < 2e-16 ***
## PTEDUCAT -0.29086 0.09878 -2.944 0.00338 **
## FDG -3.95981 0.45574 -8.689 < 2e-16 ***
## AV45 6.92069 1.39125 4.974 8.96e-07 ***
## HippoNV -35.82765 3.82534 -9.366 < 2e-16 ***
## rs610932 1.14763 0.55908 2.053 0.04061 *
## rs3865444 1.01433 0.53430 1.898 0.05820 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.055 on 510 degrees of freedom
## Multiple R-squared: 0.4366, Adjusted R-squared: 0.43
## F-statistic: 65.88 on 6 and 510 DF, p-value: < 2.2e-16
anova(lm.AD.F ,lm.AD)
## Analysis of Variance Table
##
## Model 1: TOTAL13 ~ PTEDUCAT + FDG + AV45 + HippoNV + rs610932 + rs3865444
## Model 2: TOTAL13 ~ AGE + PTGENDER + PTEDUCAT + FDG + AV45 + HippoNV +
## e2_1 + e4_1 + rs3818361 + rs744373 + rs11136000 + rs610932 +
## rs3851179 + rs3764650 + rs3865444
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 510 18700
## 2 501 18514 9 186.77 0.5616 0.8287
By using final 6 predictors, R-squared value is 0.4366, which is not too far off from the R-squared value of 0.4423 that was resulted by the model using all predictors. By applying F-test using anova() function, two models are statistically indistingushiable with p-value as 0.8287.
# Diagnostics graphs:
library("ggfortify")
autoplot(lm.AD.F, which = 1:6, ncol = 3, label.size = 3)
Residuals vs fitted graph and scale location graph still do not show significant patterns in the graphs, indicating the assumption of non-linearity was not violated significantly.Normal Q-Q plot for the new model improved a lot from the previous model with only demographic predictors.
Therefore, we decided to select the model with the 6 predictors (PTEDUCAT,FDG,AV45,HippoNV,rs610932 ,rs3865444) as the final model.
# Evaluate the variable importance by all subsets regression
# install.packages("leaps")
library(leaps)
leaps<-regsubsets(TOTAL13 ~ ., data = AD_full,nbest=4)
# view results
summary(leaps)
## Subset selection object
## Call: regsubsets.formula(TOTAL13 ~ ., data = AD_full, nbest = 4)
## 15 Variables (and intercept)
## Forced in Forced out
## AGE FALSE FALSE
## PTGENDER FALSE FALSE
## PTEDUCAT FALSE FALSE
## FDG FALSE FALSE
## AV45 FALSE FALSE
## HippoNV FALSE FALSE
## e2_1 FALSE FALSE
## e4_1 FALSE FALSE
## rs3818361 FALSE FALSE
## rs744373 FALSE FALSE
## rs11136000 FALSE FALSE
## rs610932 FALSE FALSE
## rs3851179 FALSE FALSE
## rs3764650 FALSE FALSE
## rs3865444 FALSE FALSE
## 4 subsets of each size up to 8
## Selection Algorithm: exhaustive
## AGE PTGENDER PTEDUCAT FDG AV45 HippoNV e2_1 e4_1 rs3818361
## 1 ( 1 ) " " " " " " "*" " " " " " " " " " "
## 1 ( 2 ) " " " " " " " " " " "*" " " " " " "
## 1 ( 3 ) " " " " " " " " "*" " " " " " " " "
## 1 ( 4 ) " " " " " " " " " " " " " " "*" " "
## 2 ( 1 ) " " " " " " "*" " " "*" " " " " " "
## 2 ( 2 ) " " " " " " " " "*" "*" " " " " " "
## 2 ( 3 ) " " " " " " "*" "*" " " " " " " " "
## 2 ( 4 ) "*" " " " " "*" " " " " " " " " " "
## 3 ( 1 ) " " " " " " "*" "*" "*" " " " " " "
## 3 ( 2 ) " " " " "*" "*" " " "*" " " " " " "
## 3 ( 3 ) " " " " " " "*" " " "*" " " " " " "
## 3 ( 4 ) " " " " " " "*" " " "*" " " " " " "
## 4 ( 1 ) " " " " "*" "*" "*" "*" " " " " " "
## 4 ( 2 ) " " " " " " "*" "*" "*" " " " " " "
## 4 ( 3 ) " " " " " " "*" "*" "*" " " " " " "
## 4 ( 4 ) " " " " " " "*" "*" "*" " " " " " "
## 5 ( 1 ) " " " " "*" "*" "*" "*" " " " " " "
## 5 ( 2 ) " " " " "*" "*" "*" "*" " " " " " "
## 5 ( 3 ) " " "*" "*" "*" "*" "*" " " " " " "
## 5 ( 4 ) " " " " "*" "*" "*" "*" " " " " " "
## 6 ( 1 ) " " " " "*" "*" "*" "*" " " " " " "
## 6 ( 2 ) " " "*" "*" "*" "*" "*" " " " " " "
## 6 ( 3 ) " " " " "*" "*" "*" "*" " " " " "*"
## 6 ( 4 ) " " " " "*" "*" "*" "*" " " " " " "
## 7 ( 1 ) " " "*" "*" "*" "*" "*" " " " " " "
## 7 ( 2 ) " " " " "*" "*" "*" "*" " " " " " "
## 7 ( 3 ) " " " " "*" "*" "*" "*" " " " " "*"
## 7 ( 4 ) "*" " " "*" "*" "*" "*" " " " " " "
## 8 ( 1 ) " " "*" "*" "*" "*" "*" " " " " "*"
## 8 ( 2 ) " " "*" "*" "*" "*" "*" " " " " " "
## 8 ( 3 ) "*" "*" "*" "*" "*" "*" " " " " " "
## 8 ( 4 ) " " "*" "*" "*" "*" "*" " " " " " "
## rs744373 rs11136000 rs610932 rs3851179 rs3764650 rs3865444
## 1 ( 1 ) " " " " " " " " " " " "
## 1 ( 2 ) " " " " " " " " " " " "
## 1 ( 3 ) " " " " " " " " " " " "
## 1 ( 4 ) " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " "
## 2 ( 2 ) " " " " " " " " " " " "
## 2 ( 3 ) " " " " " " " " " " " "
## 2 ( 4 ) " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " "
## 3 ( 2 ) " " " " " " " " " " " "
## 3 ( 3 ) " " " " "*" " " " " " "
## 3 ( 4 ) " " " " " " " " " " "*"
## 4 ( 1 ) " " " " " " " " " " " "
## 4 ( 2 ) " " " " " " " " " " "*"
## 4 ( 3 ) " " " " "*" " " " " " "
## 4 ( 4 ) " " " " " " " " "*" " "
## 5 ( 1 ) " " " " "*" " " " " " "
## 5 ( 2 ) " " " " " " " " " " "*"
## 5 ( 3 ) " " " " " " " " " " " "
## 5 ( 4 ) " " " " " " " " "*" " "
## 6 ( 1 ) " " " " "*" " " " " "*"
## 6 ( 2 ) " " " " "*" " " " " " "
## 6 ( 3 ) " " " " "*" " " " " " "
## 6 ( 4 ) " " " " "*" " " "*" " "
## 7 ( 1 ) " " " " "*" " " " " "*"
## 7 ( 2 ) " " " " "*" " " "*" "*"
## 7 ( 3 ) " " " " "*" " " " " "*"
## 7 ( 4 ) " " " " "*" " " " " "*"
## 8 ( 1 ) " " " " "*" " " " " "*"
## 8 ( 2 ) " " " " "*" " " "*" "*"
## 8 ( 3 ) " " " " "*" " " " " "*"
## 8 ( 4 ) " " "*" "*" " " " " "*"
# plot a table of models showing variables in each model.
# models are ordered by the selection statistic.
plot(leaps,scale="r2")
leaps() function shows which model acheive highest R-squared value and which variables frequently appear on these models. By observing the graph, the highest R-squared value is resulted in the model that uses 8 predictors that are PTGENDER, PTEDUCAT, FDG, AV45, HippoNV, rs3818361, rs610932, rs3865444. All of 6 predictors used in our final model (PTEDUCAT,FDG,AV45,HippoNV,rs610932 ,rs3865444) were included in these 8 predictors.
Passing significance test and fitting the model only mean that there is nothing significant against the model, meaning this is not the causal model therefore, other models can also fit the data possibliy.
assumptions of the estimations 1. the estimations of the regression parameters are independent (correlations are zero) 2. the variances of the regression parameters are the same
Interactions of the predictors could provide useful insights additionally. However, choosing which interaction terms are difficult. Careful and thorough analytics require in selecting interaction terms before including them into the model.
Scatter plots helps visualizing the relationship between any variable with the outcome. Insights on how the relationship changes according to another variables can be achieved.
For continuous predictors:
library(ggplot2)
library(GGally)
p <- ggpairs(AD[,c(17,1,3,4,5,6)], upper = list(continuous = "points")
, lower = list(continuous = "cor")
)
print(p)
For categorical predictors:
# Boxplot
library(ggplot2)
qplot(factor(PTGENDER), TOTAL13, data = AD,
geom=c("boxplot"), fill = factor(PTGENDER))
qplot(factor(rs3818361), TOTAL13, data = AD,
geom=c("boxplot"), fill = factor(rs3818361))
qplot(factor(rs11136000), TOTAL13, data = AD,
geom=c("boxplot"), fill = factor(rs11136000))
qplot(factor(rs744373), TOTAL13, data = AD,
geom=c("boxplot"), fill = factor(rs744373))
qplot(factor(rs610932), TOTAL13, data = AD,
geom=c("boxplot"), fill = factor(rs610932))
qplot(factor(rs3865444), TOTAL13, data = AD,
geom=c("boxplot"), fill = factor(rs3865444))
# Histogram
library(ggplot2)
qplot(TOTAL13, data = AD, geom = "histogram",
fill = factor(PTGENDER))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
qplot(TOTAL13, data = AD, geom = "histogram",
fill = factor(rs3818361))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
qplot(TOTAL13, data = AD, geom = "histogram",
fill = factor(rs11136000))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
qplot(TOTAL13, data = AD, geom = "histogram",
fill = factor(rs744373))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
qplot(TOTAL13, data = AD[,c(10,12,15,17)], geom = "histogram",
fill = factor(rs610932))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
qplot(TOTAL13, data = AD[,c(10,12,15,17)], geom = "histogram",
fill = factor(rs3865444))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Using all variables:
library(rpart)
library(rpart.plot)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following object is masked from 'package:car':
##
## recode
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:RCurl':
##
## complete
library(ggplot2)
library(partykit)
## Loading required package: grid
theme_set(theme_gray(base_size = 15))
data <- AD_full
tree <- rpart(TOTAL13 ~ ., data, method="anova")
prp(tree, nn.cex = 1)
Because TOTAL13 is the continuous variables, the average of each pair of consecutive values is used as splitting value the values.
print(tree$variable.importance)
## FDG HippoNV AV45 AGE
## 10003.6912 4668.3064 2725.7887 968.0736
FDG has the largest importance scores among all variables.
While cp controls the model complexity, the tree can be pruned with the prune function to minimize relative error by splitting the node. less-complex tree can be created by a larger cp. cp = 0.05 and then cp = 0.2 were used to compare the complexity of the decision tree model.
tree_0.05 <- prune(tree, cp = 0.05)
prp(tree_0.05, nn.cex = 1)
tree_0.1 <- prune(tree, cp = 0.2)
prp(tree_0.1, nn.cex = 1)
As cp value becomes larger, the size of decision tree becomes smaller.
In order to find the adquate size of the nodes, we conducted further analysis on the model. We used the half of the data points for training a decision tree and the rest of them for testing. As training error and testing error can be calculated for each structure, the number of leaf nodes can be found and used for complexity measurement of the tree.
set.seed(1)
train_sample <- sample(nrow(data),floor( nrow(data)/2) )
errintrain <- NULL
errintest <- NULL
leaf.v <- NULL
for(i in seq(0.2,0,by=-0.005) ){
tree <- rpart( TOTAL13 ~ ., data = data[train_sample,], cp= i )
pred.train <- floor(predict(tree, data[train_sample,]))
pred.test <- floor(predict(tree, data[-train_sample,]))
current_error_train <- length(which(pred.train != data[train_sample,]$TOTAL13))/length(pred.train)
current_error_test <- length(which(pred.test != data[-train_sample,]$TOTAL13))/length(pred.test)
errintrain <- c(errintrain, current_error_train)
errintest <- c(errintest, current_error_test)
leaf.v <- c(leaf.v, length(which(tree$frame$var == "<leaf>")))
}
err.mat <- as.data.frame( cbind( train_err = errintrain, test_err = errintest , leaf_num = leaf.v ) )
err.mat$leaf_num <- as.factor( err.mat$leaf_num )
err.mat <- unique(err.mat)
err.mat <- err.mat %>% gather(type, error, train_err,test_err)
print(err.mat)
## leaf_num type error
## 1 2 train_err 0.9612403
## 2 3 train_err 0.9224806
## 3 4 train_err 0.9263566
## 4 5 train_err 0.9147287
## 5 6 train_err 0.9108527
## 6 8 train_err 0.9147287
## 7 18 train_err 0.9069767
## 8 21 train_err 0.8759690
## 9 2 test_err 0.9382239
## 10 3 test_err 0.9382239
## 11 4 test_err 0.9305019
## 12 5 test_err 0.9150579
## 13 6 test_err 0.9266409
## 14 8 test_err 0.9266409
## 15 18 test_err 0.9498069
## 16 21 test_err 0.9498069
The test errors and train errors according to different size of the leaf nodes are plotted. Train errors generatlly decrease while the test errerors first decrease and then increase at leaf number eqauals to 5. Therefore, The adequate number of leaf node would be 5. Other leaf numbers may result overfitting of predicted data.
data.plot <- err.mat %>% mutate(type = type)
ggplot(data.plot, aes(x=leaf_num, y=error, shape = type, color=type)) + geom_line() +
geom_point(size=5)
## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?
Final decision tree model can be selected with the 5 decision (leaf) nodes WITH cp = 0.02
tree_0.020 <- prune(tree, cp = 0.02)
prp(tree_0.05, nn.cex = 1)
Decision tree used the 4 variables : FDG, HippoNV, AV45, AGE for the 5 decision nodes (using FDG twice with different threasholds). Three variables including FDB, HippoNV, AV45 are the commonly shared with the fianl regression model with 6 valueables (PTEDUCAT,FDG,AV45,HippoNV,rs610932 ,rs3865444).
Mean square error (MSE) can be used for comparing the two models.
MSE of our regression model can be calculated as:
# install.packages("Metrics")
library(Metrics)
predictedvalue <- predict(lm.AD.F,data)
mse(data$TOTAL13,predictedvalue)
## [1] 36.17068
#double check the function mse() is working
mean((data$TOTAL13-predictedvalue)^2)
## [1] 36.17068
MSE of our decision tree model can be calculated as:
# install.packages("Metrics")
# library(Metrics)
predictedvalue <- predict(tree_0.020,data)
mse(data$TOTAL13,predictedvalue)
## [1] 37.1621
#double check the function mse() is working
mean((data$TOTAL13-predictedvalue)^2)
## [1] 37.1621
As MSE of regression model is lower than MSE of tree model. Therefore, we should choose regression model over tree model to minimize the errors.
textbook, PennState Eberly College of Science (https://onlinecourses.science.psu.edu/stat501), University of Virginia Library (http://data.library.virginia.edu/understanding-q-q-plots/)
Find two data sets from the UCI data repository or R datasets. Conduct a detailed regression analysis for both datasets using both regression model and the tree model (for regression), e.g., for regression model, you may want to conduct model selection, model comparison, testing of the significance of the regression parameters, evaluation of the R-squared and significance of the model. Also comment on the application of your model on the context of the dataset you have selected.
The first dataset we chose for modeling is the cu.summary dataset that contains automobile data taken from the April, 1990 issue of “Consumer Reports”. The dataset contains 117 observations and 5 features. We are trying to predict the price of the car. A table of feature descriptions is provided below.
| Feature Name | Description |
|---|---|
| Price | a numeric vector giving the list price in US dollars of a standard model |
| Country | Country of origin |
| Reliability | an ordered factor with levels ‘Much worse’ < ‘worse’ < ‘average’ < ‘better’ < ‘Much better’ |
| Mileage | fuel consumption miles per US gallon |
| Type | a factor with levels Compact Large Medium Small Sporty Van |
We fit a linear regression model with all of the features to predict the price of the car. Looking at the summary, we see that this linear regression model explains 83.7% of the variability in the data.
lm.car <- lm(Price~., data=df.car)
summary(lm.car)
##
## Call:
## lm(formula = Price ~ ., data = df.car)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3078.1 -1048.3 3.8 920.1 5341.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18992.2 3389.0 5.604 3.42e-06 ***
## CountryJapan 1213.8 2719.7 0.446 0.65838
## CountryJapan/USA -154.6 2756.3 -0.056 0.95562
## CountryKorea -2203.6 2962.6 -0.744 0.46243
## CountryMexico -3176.7 3298.4 -0.963 0.34272
## CountrySweden 5558.9 3164.0 1.757 0.08850 .
## CountryUSA -2625.2 2566.9 -1.023 0.31411
## Reliabilitybetter 1853.7 1460.1 1.270 0.21340
## ReliabilityMuch better -913.6 1452.3 -0.629 0.53377
## ReliabilityMuch worse -415.1 1267.3 -0.328 0.74538
## Reliabilityworse -822.7 1266.5 -0.650 0.52057
## Mileage -265.3 129.7 -2.044 0.04921 *
## TypeLarge 5140.8 1686.4 3.048 0.00459 **
## TypeMedium 5431.7 1070.7 5.073 1.61e-05 ***
## TypeSmall -2100.3 1346.9 -1.559 0.12875
## TypeSporty 894.6 1282.7 0.697 0.49057
## TypeVan 1723.1 1780.6 0.968 0.34045
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2091 on 32 degrees of freedom
## (68 observations deleted due to missingness)
## Multiple R-squared: 0.837, Adjusted R-squared: 0.7555
## F-statistic: 10.27 on 16 and 32 DF, p-value: 1.899e-08
We use the stepAIC function to identify and discard the insignificant features in our model. Looking at the summary, we see that none of the features have been discarded, showing that all of the features in our initial linear regression are significant. Also, let us note that our initial model is the optimal linear regression model given the data.
Looking at the summary, we see that the features which increase the price are: the size of the vehicle (medium or large), a better reliability, the country it was made in (Japan and Sweden), and the type of vehicle (sporty or van). Thus, three out of the four factors positively affect the price. The coefficients of the other features were all negative.
lm.car <- stepAIC(lm.car, direction="both")
## Start: AIC=762.39
## Price ~ Country + Reliability + Mileage + Type
##
## Df Sum of Sq RSS AIC
## <none> 139957482 762.39
## - Reliability 4 27132444 167089926 763.07
## - Mileage 1 18280336 158237818 766.40
## - Country 6 80914089 220871570 772.74
## - Type 5 168527460 308484942 791.11
summary(lm.car)
##
## Call:
## lm(formula = Price ~ Country + Reliability + Mileage + Type,
## data = df.car)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3078.1 -1048.3 3.8 920.1 5341.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18992.2 3389.0 5.604 3.42e-06 ***
## CountryJapan 1213.8 2719.7 0.446 0.65838
## CountryJapan/USA -154.6 2756.3 -0.056 0.95562
## CountryKorea -2203.6 2962.6 -0.744 0.46243
## CountryMexico -3176.7 3298.4 -0.963 0.34272
## CountrySweden 5558.9 3164.0 1.757 0.08850 .
## CountryUSA -2625.2 2566.9 -1.023 0.31411
## Reliabilitybetter 1853.7 1460.1 1.270 0.21340
## ReliabilityMuch better -913.6 1452.3 -0.629 0.53377
## ReliabilityMuch worse -415.1 1267.3 -0.328 0.74538
## Reliabilityworse -822.7 1266.5 -0.650 0.52057
## Mileage -265.3 129.7 -2.044 0.04921 *
## TypeLarge 5140.8 1686.4 3.048 0.00459 **
## TypeMedium 5431.7 1070.7 5.073 1.61e-05 ***
## TypeSmall -2100.3 1346.9 -1.559 0.12875
## TypeSporty 894.6 1282.7 0.697 0.49057
## TypeVan 1723.1 1780.6 0.968 0.34045
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2091 on 32 degrees of freedom
## (68 observations deleted due to missingness)
## Multiple R-squared: 0.837, Adjusted R-squared: 0.7555
## F-statistic: 10.27 on 16 and 32 DF, p-value: 1.899e-08
We now analyze our final linear regression model by use of the residual vs fitted and QQ-plots. The residual vs fitted plot shows randomly distributed points, which means that the regression assumptions have not been violated. The QQ-plot shows points clustered along the line indicating that the dependent variables can plausibly be normally distributed.
par(mar = rep(2, 4)) # Change the margins
plot(lm.car, which=c(1))
plot(lm.car, which=c(2))
## Warning: not plotting observations with leverage one:
## 6, 12, 32
This model can be used to study which factors have the biggest impact on increasing the price of a vehicle. Understanding the extent in which certain features inflate the price of a vehicle can help manufacturers decide which areas they should focus on when attempting to decrease their prices.
A model of the decision tree fitted to our dataset is produced below. The decision tree only incorporates two out of the four features; the tree incorporates the type and country of origin of each vehicle while it discards the information pertaining to the reliability and mileage of the vehicle. In contrast, the linear regression model incorporated all four of the given features and showed that including three of the features would always positively affect the price. This discrepancy between models suggests that each model may be able to better explain certain characteristics of the data.
library("rpart.plot")
tr.car <- rpart(Price~., data=df.car)
prp(tr.car, varlen=3)
The second dataset we chose for modeling is the Males dataset that recorded data about the wages and education of young males. The dataset contained 4360 observations and 12 features. We are trying to predict the wage of the person. A table of feature descriptions is provided below.
| Feature Name | Description |
|---|---|
| nr | unique identifier |
| year | year data was collected |
| school | years of schooling |
| exper | years of experience |
| union | wage set by collective bargaining? |
| ethn | ethnicity |
| married | are they married? |
| health | health problems? |
| wage | log of hourly wage |
| industry | work industry |
| occupation | job occupation |
| residence | area of residence |
We first fit an initial linear regression model with all of the features to predict the wage of the individual. From looking at the summary, the initial linear regression model is able to explain 28.54% of the variability of the wage.
lm.males <- lm(wage~., data=df.males)
Next we used some feature selection using stepwise regresssion to fit the best model. The final model produced from the stepwise regression found all but the health feature to be signficant. By analyzing the difference in the models, we can see that the R-squared was only reduced to 28.53%. The final found all included features to be significant at a 0.05 level. When looking at the coeffecients, that years of schooling, year of experience, union, ethnicity, and being married all had a positive impact on wages per unit increase while keeping all others constant. Other coeffecients were a mix of positive and negative values depending on the specific categorical value taken.
lm.males <- stepAIC(lm.males, direction="both")
## Start: AIC=-4910.9
## wage ~ nr + year + school + exper + union + ethn + married +
## health + industry + occupation + residence
##
## Df Sum of Sq RSS AIC
## - health 1 0.141 630.89 -4912.2
## <none> 630.75 -4910.9
## - nr 1 1.394 632.14 -4906.0
## - exper 1 1.405 632.15 -4906.0
## - year 1 3.152 633.90 -4897.4
## - ethn 2 3.879 634.63 -4895.8
## - married 1 4.809 635.56 -4889.2
## - residence 3 9.557 640.30 -4870.1
## - occupation 8 15.155 645.90 -4852.9
## - union 1 15.827 646.57 -4835.7
## - school 1 28.968 659.71 -4773.0
## - industry 11 50.919 681.67 -4691.1
##
## Step: AIC=-4912.21
## wage ~ nr + year + school + exper + union + ethn + married +
## industry + occupation + residence
##
## Df Sum of Sq RSS AIC
## <none> 630.89 -4912.2
## + health 1 0.141 630.75 -4910.9
## - nr 1 1.389 632.28 -4907.4
## - exper 1 1.417 632.30 -4907.2
## - year 1 3.155 634.04 -4898.7
## - ethn 2 3.862 634.75 -4897.2
## - married 1 4.785 635.67 -4890.7
## - residence 3 9.567 640.45 -4871.3
## - occupation 8 15.338 646.23 -4853.4
## - union 1 16.019 646.91 -4836.1
## - school 1 29.104 659.99 -4773.7
## - industry 11 50.937 681.82 -4692.3
summary(lm.males)
##
## Call:
## lm(formula = wage ~ nr + year + school + exper + union + ethn +
## married + industry + occupation + residence, data = df.males)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1925 -0.2273 0.0224 0.2568 2.2815
##
## Coefficients:
## Estimate Std. Error
## (Intercept) -5.836e+01 1.487e+01
## nr -6.937e-06 2.662e-06
## year 2.957e-02 7.531e-03
## school 7.621e-02 6.389e-03
## exper 1.754e-02 6.666e-03
## unionyes 1.794e-01 2.028e-02
## ethnhisp 9.749e-02 3.459e-02
## ethnother 1.133e-01 2.620e-02
## marriedyes 8.652e-02 1.789e-02
## industryBusiness_and_Repair_Service 2.349e-01 6.760e-02
## industryConstruction 3.316e-01 6.837e-02
## industryEntertainment -1.451e-01 9.022e-02
## industryFinance 4.822e-01 7.558e-02
## industryManufacturing 3.814e-01 6.245e-02
## industryMining 4.450e-01 9.605e-02
## industryPersonal_Service 1.984e-01 8.611e-02
## industryProfessional_and_Related Service 6.897e-02 6.810e-02
## industryPublic_Administration 2.905e-01 7.309e-02
## industryTrade 1.485e-01 6.279e-02
## industryTransportation 4.458e-01 6.739e-02
## occupationCraftsmen, Foremen_and_kindred 5.734e-02 3.162e-02
## occupationFarm_Laborers_and_Foreman 1.046e-02 9.341e-02
## occupationLaborers_and_farmers -3.476e-02 3.710e-02
## occupationManagers, Officials_and_Proprietors 1.458e-01 3.611e-02
## occupationOperatives_and_kindred -3.779e-02 3.229e-02
## occupationProfessional, Technical_and_kindred 1.732e-01 3.580e-02
## occupationSales_Workers 6.662e-02 4.391e-02
## occupationService_Workers -5.474e-02 3.419e-02
## residencenothern_central -1.457e-01 2.285e-02
## residencerural_area -1.034e-01 5.429e-02
## residencesouth -1.242e-01 2.172e-02
## t value Pr(>|t|)
## (Intercept) -3.923 8.92e-05 ***
## nr -2.606 0.009199 **
## year 3.927 8.79e-05 ***
## school 11.928 < 2e-16 ***
## exper 2.632 0.008537 **
## unionyes 8.849 < 2e-16 ***
## ethnhisp 2.819 0.004855 **
## ethnother 4.325 1.57e-05 ***
## marriedyes 4.837 1.39e-06 ***
## industryBusiness_and_Repair_Service 3.475 0.000518 ***
## industryConstruction 4.849 1.30e-06 ***
## industryEntertainment -1.609 0.107776
## industryFinance 6.381 2.03e-10 ***
## industryManufacturing 6.108 1.14e-09 ***
## industryMining 4.633 3.76e-06 ***
## industryPersonal_Service 2.304 0.021311 *
## industryProfessional_and_Related Service 1.013 0.311247
## industryPublic_Administration 3.975 7.20e-05 ***
## industryTrade 2.365 0.018088 *
## industryTransportation 6.615 4.37e-11 ***
## occupationCraftsmen, Foremen_and_kindred 1.813 0.069878 .
## occupationFarm_Laborers_and_Foreman 0.112 0.910864
## occupationLaborers_and_farmers -0.937 0.348912
## occupationManagers, Officials_and_Proprietors 4.038 5.53e-05 ***
## occupationOperatives_and_kindred -1.170 0.242006
## occupationProfessional, Technical_and_kindred 4.839 1.37e-06 ***
## occupationSales_Workers 1.517 0.129320
## occupationService_Workers -1.601 0.109440
## residencenothern_central -6.375 2.10e-10 ***
## residencerural_area -1.905 0.056834 .
## residencesouth -5.719 1.17e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4523 on 3084 degrees of freedom
## (1245 observations deleted due to missingness)
## Multiple R-squared: 0.2853, Adjusted R-squared: 0.2783
## F-statistic: 41.03 on 30 and 3084 DF, p-value: < 2.2e-16
After obtaining this final model, we need to analyze the residual vs fitted plot and the QQ-plot to see if any of the linear regression assumptions are violated. From looking at the residual vs fitted plot, the points seem to be randomly distrubted and there are no trends as fitted values increase, so there does not seem to be any violations. When looking at the QQ-plot, there seems to be deviation from normality assumption.
plot(lm.males, which=c(1))
plot(lm.males, which=c(2))
The applications of this model could be used to study what types of features are correlated most with a high wage. This could be important in helping to decide policy of whether more education helps to increase wage and to study if there are discrepencies among different socioeconomic factors.
A decision tree was fit to the data and the output of the model is seen below. Compared to the regression model, the decision tree only incorporated the use of 5 out of 11 features. These features are industry, years of schooling, year the data was collected, married, and years of work experience. From the analyzing the rules the decision tree came up with, the observations with the highest wage are those who have 12+ years of school and are not a part of the agriculture, construction, entertainment, personal services, professional services, and transportation. This type can be used to see if there are any underlying patterns in the data that a linear regression might not be able to ascertain and can be used to predict wages of young professional males.
tr.males <- rpart(wage~.,data=df.males)
prp(tr.males, nn.cex=1)
Build a decision tree model based on the following dataset. Don’t use R. Use your pen and paper, and show the process.
df.q5 <- data.frame(x1=c(0.22,0.58,0.57,0.41,0.60,0.12,0.25,0.32),
x2=c(0.38,0.32,0.28,0.43,0.29,0.32,0.32,0.38),
y=factor(c("No","Yes","Yes","Yes","No","Yes","Yes","No")))
We chose an arbitrary rule of splitting on the 4 quantiles. The first split is on x2 at the 3rd quantile, which is \(x2 \leq 0.38\). The left node will contain data points (1,2,3,5,6,7,8) and the right node will contain data point (4).
\[e_{root} = -\frac{5}{8} log_{2}\frac{5}{8} - \frac{3}{8} log_{2}\frac{3}{8} = 0.9544\] \[e_{x2 \leq 0.38} = -\frac{4}{7} log_{2}\frac{4}{7} - \frac{3}{7} log_{2}\frac{3}{7} = 0.9852\] \[e_{x2 \gt 0.38} -\frac{1}{1} log_{2}\frac{1}{1} - \frac{0}{1} log_{2}\frac{0}{1} = 0\] \[IG = e_{root} - \frac{7}{8}*e_{x2 \leq 0.38}-\frac{1}{8}*e_{x2 \gt 0.38} = 0.0924\]
The second split was done on left node containing data points (1,2,3,5,6,7,8) on x2 at the 3rd quantile, which is \(x2 \leq 0.35\). The left node will contain data points (2,3,5,6,7) and the right node will contain data points (1,8).
\[e_{root} = -\frac{4}{7} log_{2}\frac{4}{7} - \frac{3}{7} log_{2}\frac{3}{7} = 0.9852\] \[e_{x2 \leq 0.35} = -\frac{4}{5} log_{2}\frac{4}{5} - \frac{1}{5} log_{2}\frac{1}{5} = 0.7219\] \[e_{x2 \gt 0.35} -\frac{0}{2} log_{2}\frac{0}{2} - \frac{2}{2} log_{2}\frac{2}{2} = 0\] \[IG = e_{root} - \frac{5}{7}*e_{x2 \leq 0.35}-\frac{2}{7}*e_{x2 \gt 0.35} = 0.4696\]
The third split was on the left node containin data points (2,3,5,6,7) on x1 at the 3rd quantile, which is \(x1 \leq 0.58\). The left node will contain data points (2,3,6,7) and the right node will contain data point (5).
\[e_{root} = -\frac{4}{5} log_{2}\frac{4}{5} - \frac{1}{5} log_{2}\frac{1}{5} = 0.7219\] \[e_{x1 \leq 0.58} = -\frac{4}{4} log_{2}\frac{4}{4} - \frac{0}{4} log_{2}\frac{0}{4} = 0\] \[e_{x1 \gt 0.58} -\frac{0}{1} log_{2}\frac{0}{1} - \frac{1}{1} log_{2}\frac{1}{1} = 0\] \[IG = e_{root} - \frac{4}{5}*e_{x1 \leq 0.58}-\frac{1}{5}*e_{x1 \gt 0.58} = 0.7219\]
Write your own R script to implement the least squares estimation of a regression model. Compare the output from your script with the output from lm().
We arbitrarily choose the number of input variables as p = 3, and define the following function.
#least_squares: Returns the least squares estimation of a regression model
least_squares <- function(x1, x2, x3, y){
# Creating the X and Y matrices
x_matrix = as.matrix(cbind(rep(1, 20), x1, x2, x3))
y_matrix = as.matrix(y)
# Computing the beta estimator
beta_estimator = solve( t(x_matrix) %*% x_matrix ) %*% t(x_matrix) %*% y_matrix
return(beta_estimator)
}
We can now compare our function against the output from lm()
# Choose random x1, x2, x3 inputs
x1 = runif(1:20)
x2 = runif(1:20)
x3 = runif(1:20)
# Create an arbitrary y
y = 4 + 2*x1 + 1*x2 + 8*x3
# Call our function: least_squares
least_squares(x1, x2, x3, y)
## [,1]
## 4
## x1 2
## x2 1
## x3 8
# Calling the R function for linear regression: lm
lm(y~., data = data.frame(x1, x2, x3, y))
##
## Call:
## lm(formula = y ~ ., data = data.frame(x1, x2, x3, y))
##
## Coefficients:
## (Intercept) x1 x2 x3
## 4 2 1 8